fix: Correct the number of pruned/matched Parquet pages#22031
fix: Correct the number of pruned/matched Parquet pages#22031nuno-faria wants to merge 1 commit intoapache:mainfrom
Conversation
| Plan with Metrics | ||
| 01)SortExec: TopK(fetch=3), expr=[species@0 ASC NULLS LAST], preserve_partitioning=[false], filter=[species@0 < Nlpine Sheep], metrics=[output_rows=3] | ||
| 02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/explain_analyze/data.parquet]]}, projection=[species, s], file_type=parquet, predicate=species@0 > M AND s@1 >= 50 AND DynamicFilter [ species@0 < Nlpine Sheep ], pruning_predicate=species_null_count@1 != row_count@2 AND species_max@0 > M AND s_null_count@4 != row_count@2 AND s_max@3 >= 50 AND species_null_count@1 != row_count@2 AND species_min@5 < Nlpine Sheep, required_guarantees=[], metrics=[output_rows=3, files_ranges_pruned_statistics=1 total → 1 matched, row_groups_pruned_statistics=4 total → 3 matched -> 1 fully matched, row_groups_pruned_bloom_filter=3 total → 3 matched, page_index_pages_pruned=6 total → 6 matched, limit_pruned_row_groups=0 total → 0 matched, scan_efficiency_ratio=22.13% (521/2.35 K)] | ||
| 02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/explain_analyze/data.parquet]]}, projection=[species, s], file_type=parquet, predicate=species@0 > M AND s@1 >= 50 AND DynamicFilter [ species@0 < Nlpine Sheep ], pruning_predicate=species_null_count@1 != row_count@2 AND species_max@0 > M AND s_null_count@4 != row_count@2 AND s_max@3 >= 50 AND species_null_count@1 != row_count@2 AND species_min@5 < Nlpine Sheep, required_guarantees=[], metrics=[output_rows=3, files_ranges_pruned_statistics=1 total → 1 matched, row_groups_pruned_statistics=4 total → 3 matched -> 1 fully matched, row_groups_pruned_bloom_filter=3 total → 3 matched, page_index_pages_pruned=3 total → 3 matched, limit_pruned_row_groups=0 total → 0 matched, scan_efficiency_ratio=22.13% (521/2.35 K)] |
There was a problem hiding this comment.
The file used contains 4 row groups (1 data page each) and the query matched only 3. So the previous 6 was due to double counting by having two predicates.
Also the same reasoning applies to the remaining changes.
2010YOUY01
left a comment
There was a problem hiding this comment.
LGTM, thank you!
For performance considerations: it does not appear to have any performance issues for now. However, if we want to reduce the overhead of metric updates in the future, we could change this metric to a ratio (for rows passing page pruning), which can be cheaply derived from overall_selection.
There was a problem hiding this comment.
Thanks @nuno-faria, I checked the change and it looks like a real bug and correct fix. I asked Codex to review and it did point out a related issue, but that can be fixed in a followup if you prefer.
| predicate.predicate_expr(), | ||
| ); | ||
|
|
||
| total_pages_in_group = pages.len(); |
There was a problem hiding this comment.
Codex suggested:
If every predicate skips due to a missing column index on line 239, the total_pages_in_group remains at 0 and matched_pages_in_group remains None. This causes the row group to silently contribute 0 total, 0 matched, and 0 pruned to the metrics, even though N pages will actually be scanned. While this behavior is not a new regression, it is an inaccuracy worth addressing while modifying this part of the codebase.
To fix this, total_pages_in_group should be derived from the offset index upfront as a property of the row group rather than within the predicate loop. By initializing matched_pages_in_group to (0..total_pages_in_group).collect(), an abstaining pruner will correctly report Option<HashSet<_>> to be simplified into a plain HashSet<_> and removes the redundant assignment of total_pages_in_group inside the loop.
There was a problem hiding this comment.
I thought returning 0 was expected, as it meant that the page index was not used. This is the same behavior as the number of rows.
If it's not supposed to be 0 I think it's better to leave as a follow up, since the number of rows also needs to be fixed. Any thoughts @2010YOUY01?
There was a problem hiding this comment.
Yes this is the same behavior as other metrics. I've found this behavior confusing. This seemed like a case where there's no good reason for this to be the case. I do agree it should be a followup.
Which issue does this PR close?
EXPLAIN ANALYZEreturns a wrong number of pruned and matched Parquet pages #22030.Rationale for this change
Show the correct metrics in the execution plan.
What changes are included in this PR?
Are these changes tested?
Yes.
Are there any user-facing changes?
No.